R Markdown

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics. We are going to use statistical graphics to visualize the results of EDA. The purpose of this article is to explore the functions of different r package to find a best data visualisation application fro IEDA.

About the Data

The Annual Data Visualization Community Survey 2019 will be used for this project. This dataset includes responses from 1350 people. We classify the dataset into 5 different categories, challenges people face, demographics data, job-related, learning-related and tool-related questions. Below is a snippet of our dataset.

Broadly, the questions will be categorized by field of topic. Five fields of topic could be identified – Job, learning, demography and tool and challenge.

Demography

For the demography part, gender, education background, major and country lived of the data visualization professionals will be checked for distribution. Bar charts will be used here.

Job

For the Job part, we can compare the proportion of each stage of data visualisation take part in a professional data visualization work. We can also check how many organizations have a data visualization team and what are the business areas data visualization supports. Stacked barcharts with errorbar will be used here.

DataViz Step-by Step

Install and Load R packages

Packages going to be used is following: tidyverse contains a set of essential packages for data manipulation and exploration. dplyr provides a function for each basic verb of data manipulation. plotly to create interactive web graphics from ‘ggplot2’ graphs. parallelPlot for easy cross-column comparison

Following codes are used to install packages.

packages <- c(“readr”, ‘tidyverse’, ‘plotly’, “ggplotify”,“parallelPlot”,“dlpr”,“dplyr”)

for (p in packages){ if (!require(p,character.only = T)){ install.packages(p) } library(p,character.only = T) }

Load the Data

to focus on several countries, filter out data about the others.

countries <- filter(data, `What country do you live in?` %in% c("USA", "Australia","Canada","Singapore","China"))

Compare the packages

In the following analysis, we are going to figure out the percentage of each education level of each country.

ggplot with faceting

With faceting, the size of barcharts are determined by the absolute number of respondents falling into each group.

ggplot(countries) +
geom_bar(mapping=aes(x = `What is your educational background`)) +
facet_wrap(~countries$`What country do you live in?`)

ggplot with dodge positioning

With fdodge positioning, it’s more convenient for us to compare the results. The size of barcharts are also determined by the absolute number of respondents falling into each group.

ountries <- filter(data, `What country do you live in?` %in% c("USA", "Australia","Canada","Singapore","China"))


p <-ggplot(countries ) + 
   geom_bar(mapping=aes(x = `What country do you live in?`,fill=`What is your educational background` ),position = "dodge")+
  coord_flip() +
  theme_minimal()



p

ggplot with data manipulation

For each country, we calculate the percentage of each educaitonal level first. Then we plot the stacked barchart with error bar. The benefit of this method is that the size of bar chart follows the percentage.

CountryNew <- countries %>% 

  group_by(`What country do you live in?`,`What is your educational background` ) %>% 
  tally() %>% 
    
    
  complete(`What is your educational background`, fill = list(n = 0)) %>% 
  mutate(percentage = n / sum(n) * 100)
    
p<-ggplot(CountryNew, aes(`What country do you live in?`, percentage, fill = `What is your educational background`)) + 
  geom_bar(stat = 'identity', position = 'dodge') +
   geom_errorbar(stat = 'identity', position = 'dodge',aes(ymin = percentage - 2* sd(percentage), ymax = percentage + 2*sd(percentage)))+ 
   coord_flip() +
    
  theme_bw()
p

ggplotly

ggplotly provides interactivity.

CountryNew <- countries %>% 

  
  


  group_by(`What country do you live in?`,`What is your educational background` ) %>% 
  tally() %>% 
    

    
  complete(`What is your educational background`, fill = list(n = 0)) %>% 
  mutate(percentage = n / sum(n) * 100)
    
p<-ggplot(CountryNew, aes(`What country do you live in?`, percentage, fill = `What is your educational background`)) + 
  geom_bar(stat = 'identity', position = 'dodge') +
   geom_errorbar(stat = 'identity', position = 'dodge',aes(ymin = percentage - 2* sd(percentage), ymax = percentage + 2*sd(percentage)))+ 
   coord_flip() +
    
  theme_bw()



 ggplotly (p)

parallelplot

one of a easiest way to compare the frequency of data in the same scale is using paeallelplot. The following example is to compare the time spending in different tasks in the way vising. It’s a pity that plotly does not support parallelplot so far.

comparison<-select(countries,c(16:21))
histoVisibility <- rep(TRUE, ncol(comparison))
parallelPlot(comparison, histoVisibility = histoVisibility)

Conclusion.

If your interest is in the absolute number and you would like to focus on certain group (country here), then you can easily use ggplopy wotoh facetting.

If your care about the percenatage, then use ggplot with datamanipulation.

Plotly is always a good way providing interativity.

one of a easiest way to compare the frequency of data in the same scale is using paeallelplot.